Available data set and its variables:
## Observations: 1,012
## Variables: 51
## $ max_bid <int> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ min_bid <int> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ primary_tracking_source <fct> fr8hub, fr8hub, fr8hub, fr8hub, fr...
## $ no_bids_refused <int> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fh_commission <dbl> 875.00, 625.00, 1075.00, 600.00, 5...
## $ shipper_ask_price <dbl> 4375, 3125, 5375, 4000, 2625, 1500...
## $ distance <dbl> 1780.3230, 1517.9297, 2177.3311, 1...
## $ duration <int> 97020, 95100, 136440, 102360, 8652...
## $ delivery_scheduled_until <fct> 2017-06-22 15:00:00, 2017-07-10 19...
## $ delivery_scheduled_at_month <fct> 2017-06-01 0:00:00, 2017-07-01 0:0...
## $ delivery_scheduled_at <fct> 2017-06-22 12:00:00, 2017-07-10 12...
## $ pickup_scheduled_at <fct> 2017-06-19 18:00:00, 2017-07-07 18...
## $ is_hazmat <lgl> FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ carrier_closed_price <dbl> 3500.000, 2500.000, 4300.000, 2687...
## $ load_description <fct> consumer goods, pipes , building m...
## $ load_value <dbl> 100000, 100000, 100000, 100000, 10...
## $ load_equipment <fct> Dry Van, Dry Van, Flatbed, Flatbed...
## $ destination_point <fct> 38.8645105, -76.7279378, 41.006085...
## $ destination_point_lat <dbl> 38.86451, 41.00608, 47.68264, 41.3...
## $ destination_point_long <dbl> -76.72794, -83.64345, -117.40863, ...
## $ destination_postal_code <int> 20774, 45840, 99207, 44139, 78045,...
## $ destination_state <fct> Maryland, Ohio, Washington, Ohio, ...
## $ destination_city <fct> Upper Marlboro, Findlay, Spokane, ...
## $ destination_country <fct> United States, United States, Unit...
## $ origin_point <fct> 27.617417, -99.523012, 27.5996843,...
## $ origin_point_lat <dbl> 27.61742, 27.59968, 27.61715, 27.6...
## $ origin_point_long <dbl> -99.52301, -99.49736, -99.47522, -...
## $ origin_postal_code <int> 78045, 78045, 78045, 78045, 60618,...
## $ origin_state <fct> Texas, Texas, Texas, Texas, Illino...
## $ origin_city <fct> Laredo, Laredo, Laredo, Laredo, Ch...
## $ origin_country <fct> United States, United States, Unit...
## $ carrier_name <fct> Falcon Transport Inc, Falcon Trans...
## $ shipper_closed_price <dbl> 4375.000, 3125.000, 5375.000, 3287...
## $ status <fct> completed, completed, completed, c...
## $ is_multistop <lgl> FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ id <fct> 55ade5c6-5512-11e7-9d6d-0a580a0005...
## $ shipper_id <fct> 95ef81e4-3f50-11e7-8ffc-0a580a0001...
## $ shipment_no <int> 115, 140, 141, 159, 133, 294, 251,...
## $ matched_at <fct> 2017-06-19 17:17:32, 2017-07-07 16...
## $ is_cross_border <lgl> FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ carrier_id <fct> 0e38bd80-3efe-11e7-8a0f-0a580a0001...
## $ is_completed <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE...
## $ is_dropped <lgl> FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ unloaded_at <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ posted_at <fct> 2017-06-19 17:11:30, 2017-07-07 16...
## $ is_halted <lgl> FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ shipper_name <fct> Ventus Freight LLC, Ventus Freight...
## $ carrier_winning_bid <int> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ diffhourS <dbl> 3, 7, 9, 0, 2, 2, 9, 0, 4, 10, 9, ...
## $ diffhourSP <dbl> 66.0, 66.0, 92.0, 116.0, 90.0, 40....
## $ diffhourMP <dbl> 0.100555556, 0.047777778, 0.035000...
Note that three new variables have been created:
calculating time differences in hours.
Looking at the available variables we assume that fh_commision is a key variable of interest and is derived by \[fh\_commision = shipper\_closed\_price - carrier\_closed\_price\] We will start with examining the response variable fh_commision, by looking at its distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1750.0 100.0 200.0 247.5 388.5 1350.0
There is a long ‘tail’ to the left caused by an extremely negative value. The majority of the observations are centered around the value of \(200\) and data is slightly right skewed.
The questions that need addressing:
fh_commissions?We should identify the extreme observations:
which(mydata$fh_commission < -500)
## [1] 203
which(mydata$fh_commission > 1000)
## [1] 3 217 311 315
Since this variable is a linear combination of shipper_closed_price and carrier_closed_price it would be useful to examine the correlation between those three variables.
If we take a look once again, but this time without extreme negative fh_commition:
we will notice a very strong relationship between shipper_closed_price and carrier_closed_price. Should we model them? But!!! If we stick with fh_commission as our key variable of interest would it be appropriate to consider using shipper_closed_price and carrier_closed_price as explanatory variables in the predictive model: They would be unknown and we would indirectly predict them. The answer is clearly NO!
For that reason shipper_closed_price and carrier_closed_price will not be considered for our model.
Let us look at bivariate relationships between response variable fh_commission and potential explanatory variables.
shipper_ask_pricefh_commission vs shipper_ask_price: M v M
To explore the relationship between measured type variables we fit a regression model for a given response y and the explanatory variable x: \[ y = b_0 + b_1x + e, \] where e is the error term (part of the variablity in y that is not explained by the fitted model ie. explanatory variable(s)).
First we’ll look at the summary of the explanatory variable shipper_ask_price.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 90 1600 2572 2680 3125 100000
It would be useful to identify the observations with very high shipper_ask_price above 20K
## [1] 17 106 524 550 573 578 754
and look at the spread of the data once again without the observations above this threshold.
The key question is how important is the shipper_ask_price variable in explaining the variability in the fh_commission. To analyse this we fit a simple regression model: \[fh\_commission = b_0 + b_1shipper\_ask\_price\]
##
## Call:
## lm(formula = fh_commission ~ shipper_ask_price, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2017.7 -146.9 -47.6 125.7 1076.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.301e+02 8.466e+00 27.175 < 2e-16 ***
## shipper_ask_price 6.491e-03 1.741e-03 3.728 0.000204 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 224.7 on 1010 degrees of freedom
## Multiple R-squared: 0.01358, Adjusted R-squared: 0.0126
## F-statistic: 13.9 on 1 and 1010 DF, p-value: 0.0002035
It appears to be a statistically significant relationship, even though the model accounts for only \(1.36\%\) of variability in fh_commission (\(R^2=1.36\%\)).
distancefh_commission vs distance: M v M
We will do the same analysis as above:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.41 766.21 1228.19 1153.39 1506.17 2545.01
Distribution of distance doesn’t present any issues.
##
## Call:
## lm(formula = fh_commission ~ distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2206.26 -130.93 -29.44 112.90 964.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.80971 16.68203 4.125 4.02e-05 ***
## distance 0.15490 0.01325 11.687 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 212.4 on 1010 degrees of freedom
## Multiple R-squared: 0.1191, Adjusted R-squared: 0.1182
## F-statistic: 136.6 on 1 and 1010 DF, p-value: < 2.2e-16
This is a statisticly valid relationship with \(R^2=11.91\%\).
durationfh_commission vs duration: M v M
We expect to get very similar outcomes as for the previous analysis of fh_commission vs distance, since duration and distance are highly correlated.
cor(mydata$distance, mydata$duration)
## [1] 0.9894508
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 720 53625 77340 75253 96135 186360
Distribution of distance doesn’t present any issues.
##
## Call:
## lm(formula = fh_commission ~ duration, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2182.92 -130.42 -29.52 114.82 1020.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.516e+01 1.735e+01 4.332 1.62e-05 ***
## duration 2.290e-03 2.124e-04 10.778 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 214.3 on 1010 degrees of freedom
## Multiple R-squared: 0.1032, Adjusted R-squared: 0.1023
## F-statistic: 116.2 on 1 and 1010 DF, p-value: < 2.2e-16
This is a statisticly valid relationship with \(R^2=10.32\%\).
delivery_scheduled_until: month, daydelivery_scheduled_until slows down over the summer period and does not show any pattern over the monthly period.
delivery_scheduled_at: month, dayWe’ll do the same analysis of delivery_scheduled_at.
delivery_scheduled_until.
pickup_scheduled_at: month, dayis_hazmatfh_commission vs is_hazmat: M v A(2)
This is a ‘measured vs. attribute’ type of problem. We split the measured variable into subgroups according to the levels of the attribute variable to assess the similarity of the subdistributions. The question we want to answer is: Are the subdistributions the same or different?
Graphical visualisation of the two subdustributions might help to answer this question! Boxplots are the standard graphical method of displaying this sort of data. The graphical display of the boxplot enables both the difference in ‘means’ relative to their spreads (variability) to be assessed.
There are only a few
## [1] 6
observations with is_hazmat = TRUE and it would not be suitable to use for the modelling. Nonetheless, it is interesting to notice that all the fh_commisions for is_hazmat = TRUE are around their average value:
## # A tibble: 2 x 2
## `as.factor(is_hazmat)` mean_fh
## <fct> <dbl>
## 1 FALSE 247.
## 2 TRUE 405.
## [1] 300.00 400.00 353.25 566.25 400.00 408.00
load_descriptionfh_commission vs load_value M v A(>2)
Let us look at the summary of the `load_equipment’, but before we do that first we will check the number of possible outcomes:
## [1] 187
There are \(187\) levels all together, so we will look only at the first 21 possible outcomes to get the feel for the information it provides:
## Polymer Beads Sheet Metal Aluminum
## 392 53 42
## Polymer Beads Wire CHIPS
## 40 26 23
## Large Paper Rolls polymer beads carton
## 19 18 15
## Polymerbeads Crane Plastic
## 15 13 13
## polymer beads Polymer beads Beer
## 12 12 11
## consumer goods Small Appliances Steel Plates
## 11 9 9
## SWIG ITEM KS TOWEL; Autoparts steel
## 9 7 7
It looks very messy!!! 😮 This variable would deffinitely need to be tidied up before it could be considered for any analysis. Time is needed for organising data into a suitable format and acquiring the skills required for using ‘regular expressions’ 😬. This would involve getting rid of punctuation symbols, developing consistency in typing singular or plural, spaces, capital letters etc. For example, we should aim to get something like this, but tidier and better:
## POLYMERBEADS SHEETMETAL ALUMINUM CHIPS
## 495 53 47 26
## WIRE LARGEPAPERROLLS CARTON BEER
## 26 20 18 16
## STEEL CONSUMERGOODS CRANE PLASTIC
## 14 13 13 13
## PEANUTS SMALLAPPLIANCES STEELPLATES SWIGITEMKSTOWEL
## 9 9 9 9
## AUTOPARTS MACHINEPARTS SWINGITEMKSTOWEL FIBERGLASS
## 8 6 6 5
## [1] 152
This does look better, but it still has over \(150\) levels. This variable would need some considerable attention before it could be deemed suitable for modelling.
load_valuefh_commission vs load_value M v M
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10 45000 50000 61114 80000 600001
Let us see the spread without the ‘outliers’ identified on the boxplot at the positions
## [1] 168 347
## [1] 600001 150000
##
## Call:
## lm(formula = fh_commission ~ load_value, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2001.00 -148.92 -48.92 140.96 1107.59
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.554e+02 1.517e+01 16.835 <2e-16 ***
## load_value -1.302e-04 2.193e-04 -0.593 0.553
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 226.2 on 1010 degrees of freedom
## Multiple R-squared: 0.0003486, Adjusted R-squared: -0.0006411
## F-statistic: 0.3522 on 1 and 1010 DF, p-value: 0.553
The relationship is not statistically significant with \(R^2=0.03\%\) and \(p=0.55\).
load_equipmentfh_commission vs load_equipment: M vs A(>2)
Informal part of the analysis for a measured response against an attribute explanatory variable with more than two levels is exactly the same as for the data analysis situation were the explanatory variable has exactly two levels; the interpretation is a little more difficult due to the larger number of levels but the principles are exactly the same.
To detect a connection/link between the response variable and the explanatory variable requires a clear definition of exactly what a connection/link is.
The formal definition of no link for ‘M vs A’ is:
Conversely:
An examination of the means and the comparative boxplots will yield one of three possible decisions:
If further data analysis is required then it takes the form of a hypothesis test. The data analysis situation Measured v Attribute requires two different hypotheses tests. A hypothesis test for the data analysis situation where the attribute explanatory variable has exactly two levels uses t-test. A different hypothesis test is required for the situation where the attribute explanatory variable has three or more levels. Formal Data Analysis for ‘M v A(>2)’ is known as One-Way Analysis of Variance (often abbreviated as one-way ANOVA), for which we use the F-test.
The problem is exactly the same as with the two level attribute situation, namely is the difference between the means large enough to suggest that there is a real difference, or is the difference a difference that could have occurred by pure chance? (The difference is within the limits of sampling error). Implying that if there is no connection then by definition all the true means will be the same, whilst if there is a connection then the means are likely to be different.
Let us obtain the summary of the `load_equipment’
## Dry Van Flatbed Reefer
## 937 65 10
It appears that data is not balanced as there are only a few observations in Reefer category. Let us look at the boxplots to examine the relationship.
Boxplots for
Flatbed and Dry Van categories are overlapping and since there are only a few observations in Reefer with almost all values around the median it is hard to make a clear conclusion about the relationship between the two variables: fh_commission and load_equipment. Although one-way Anova might not be the most appropriate further analysis considering the ‘shapes’ of subdistributions it could still provide us with an insight that could help us in identifying a possible relationship. We will also perform a nonparametric Kruskal-Wallis test.
## Df Sum Sq Mean Sq F value Pr(>F)
## load_equipment 2 1831610 915805 18.53 1.25e-08 ***
## Residuals 1009 49877788 49433
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Kruskal-Wallis rank sum test
##
## data: fh_commission by load_equipment
## Kruskal-Wallis chi-squared = 14.428, df = 2, p-value = 0.0007363
For both tests the \(p\) values are well below \(0.1\%\) suggesting that on the sample evidence this is a statistically significant relationship.
destination_statefh_commission vs destination_state: M v A(>2)
## Alabama Arizona Arkansas
## 23 78 76
## California Ciudad de México Coahuila de Zaragoza
## 151 6 14
## Colorado Connecticut Estado de México
## 32 4 8
## Florida Georgia Guanajuato
## 5 18 2
## Idaho Illinois Indiana
## 1 16 17
## Iowa Kansas Kentucky
## 19 11 14
## Louisiana Maryland Massachusetts
## 4 2 2
## Michigan Minnesota Mississippi
## 8 13 6
## Missouri Montana Nebraska
## 6 11 24
## Nevada New Jersey New Mexico
## 4 3 5
## New York North Carolina Nuevo León
## 1 14 8
## Ohio Oklahoma Oregon
## 62 25 5
## Pennsylvania San Luis Potosí South Carolina
## 8 4 16
## Tamaulipas Tennessee Texas
## 8 15 211
## Utah Virginia Washington
## 21 15 5
## West Virginia Wisconsin
## 1 10
There are olmost \(50\) different levels for destination_state. Categorical data described through a large number of distinct values poses a serious challenge for regression algorithms which require numerical inputs. If we decide to use this variable in the predictive model it would be good to consider using target encoding also known as impact encoding. This technique is explained in Daniele Micci-Barreca’s papeer A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems.
Let us conduct informal data analysis for fh_commission vs destination_state.
There is a clear difference amongst the groups, but the data is not balanced and not equally spread. For a M vs A(>2) type of problem we will perform a nonparametric Kruskal-Wallis test.
##
## Kruskal-Wallis rank sum test
##
## data: fh_commission by destination_state
## Kruskal-Wallis chi-squared = 253.62, df = 46, p-value < 2.2e-16
The output (p-value < 2.2e-16) suggests that based on the sample evidence this is a statistically significant relationship.
destination_: postal_code; city; countryConsidering the type of information those variables are providing it would not be wrong to assume that they are in high correlation whit each oter. As destination_state can be observed as conglomerate of the other three, we will check their independence from it using Chi-squared Test of Independence.
## postal_code:
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 46368, df = 9522, p-value < 2.2e-16
## city:
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 45047, df = 8372, p-value < 2.2e-16
## country:
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 1012, df = 46, p-value < 2.2e-16
The \(p\) values are confirming our assumtion about those variable not being independent from each other.
origin_statefh_commission vs origin_state: M v A(>2)
We will conduct the equivalent analysis to the one we did for fh_commission vs destination_state, for obvious reasons.
Yet again, there is a clear difference amongst the groups, but the data is not balanced and not equally spread. We will perform Kruskal-Wallis test.
##
## Kruskal-Wallis rank sum test
##
## data: fh_commission by origin_state
## Kruskal-Wallis chi-squared = 155.37, df = 33, p-value < 2.2e-16
The output (p-value < 2.2e-16) suggests that based on the sample evidence this is a statistically significant relationship.
origin_: postal_code; city; countryTo check for independence between origin_state and the three variables above, we will perform Chi-squared Test of Independence for all.
## postal_code:
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = NaN, df = 2937, p-value = NA
The data is messy and we clearly have too many zero frequencies in observed counts causing Chi-squared Test to fail.
## city:
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 31175, df = 2607, p-value < 2.2e-16
## country:
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 1012, df = 33, p-value < 2.2e-16
The calculated \(p\) values are confirming our assumption about those variables not being independent from each other.
carrier_namefh_commission vs carrier_name: M v A(>2)
Let’s see how many levels this variable has.
## [1] 309
\(309\) different categories are too many for plotting on a boxplot and this is a variable that would have to be encoded for its application in the predictive model.
statusfh_commission vs status: M v A(>2)
## at_stop completed go_to_pu in_transit on_do pending
## 1 960 9 13 1 5
## unloaded
## 23
Is there a point considering this variable in relation to the fh_commission as the pricing would have been done before the status cauld be obtained?
is_multistopfh_commission vs is_multistop: M v A(2)
## Mode FALSE TRUE
## logical 988 24
What does this variable represent? Could it influence fh_commission? Let us see how many have crossed the border:
is_cross_borderfh_commission vs is_cross_border: M v A(2)
## Mode FALSE TRUE
## logical 988 24
Should we expect this variable to be directly linked with destination_country?
## Mexico United States
## 50 962
If not, as the data suggests, why not?
is_completedfh_commission vs is_completed: M v A(2)
This information is already provided through the variable status.
## Mode FALSE TRUE
## logical 52 960
## status:
## at_stop completed go_to_pu in_transit on_do pending
## 1 960 9 13 1 5
## unloaded
## 23
Should we expect those variables to have influence on fh_commission?
is_droppedfh_commission vs is_dropped: M v A(2)
## Mode FALSE TRUE
## logical 910 102
Same question as above: Is it reasonable to consider this variable as an explanatory variable of fh_commission?
shipper_namefh_commission vs shipper_name: M v A(>2)
This is an attribute variable with over \(40\) levels:
## [1] 42
which is unbalanced: Nonetheless, let’s perform Kruskal-Wallis:
##
## Kruskal-Wallis rank sum test
##
## data: fh_commission by shipper_name
## Kruskal-Wallis chi-squared = 393.69, df = 41, p-value < 2.2e-16
The \(p\) value suggests this to be a significant relationship.
diffhourSfh_commission vs diffhourS: M v M
Remember that we have derived diffhourS by calculating:
\(diffhourS = delivery\_scheduled\_until - delivery\_scheduled\_at\).
We will start by observing the spread of the diffhourS
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 2.000 4.474 8.000 108.000
## [1] 135 244 349
and look at the spread of the data once again without the observations above this threshold.
Let us fit a regression model for fh_commission vs diffhourS.
##
## Call:
## lm(formula = fh_commission ~ diffhourS, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2000.99 -150.99 -48.24 142.81 1103.73
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 250.991 8.404 29.867 <2e-16 ***
## diffhourS -0.786 1.001 -0.785 0.433
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 226.2 on 1010 degrees of freedom
## Multiple R-squared: 0.00061, Adjusted R-squared: -0.0003795
## F-statistic: 0.6164 on 1 and 1010 DF, p-value: 0.4326
Even if we removed extreme values, the line of best fit would still be flat, suggesting that there is no significant relationship, as is confirmed by the large \(p\) value of \(.4326\).
diffhourSPfh_commission vs diffhourSP: M v M
remember: \(diffhourSP = delivery\_scheduled\_at - pickup\_scheduled\_at\)
Let us look at the spread of the variable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 41.00 50.00 58.84 73.56 325.00
There are a few ‘large’ observations:
## [1] 46 204 439 454 519 819 820 821
It doesn’t match any observation we picked up earlier when observing the other variables.
Let us fit a regression model for fh_commission vs diffhourS.
##
## Call:
## lm(formula = fh_commission ~ diffhourSP, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1916.03 -127.78 -34.19 115.81 1087.71
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 140.5975 13.9114 10.107 <2e-16 ***
## diffhourSP 1.8163 0.2057 8.828 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 218 on 1010 degrees of freedom
## Multiple R-squared: 0.07164, Adjusted R-squared: 0.07072
## F-statistic: 77.94 on 1 and 1010 DF, p-value: < 2.2e-16
This is a statistically significant relationship in which \(7.16\%\) of the varibility in fh_commission is explained by diffhourSP.
diffhourMPfh_commission vs diffhourMP: M v M
\(diffhourMP = matched\_at - posted\_at\)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0056 0.0938 0.7417 11.5290 6.2753 393.8819
This is a right skewed distributions with a long tail to the right. We will not try to identify the observations in the right tail as there are too many. We will go to fit a regression model.
##
## Call:
## lm(formula = fh_commission ~ diffhourMP, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1995.34 -151.75 -51.42 139.79 1097.73
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 252.3457 7.5680 33.34 <2e-16 ***
## diffhourMP -0.4225 0.2271 -1.86 0.0631 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 225.9 on 1010 degrees of freedom
## Multiple R-squared: 0.003415, Adjusted R-squared: 0.002428
## F-statistic: 3.461 on 1 and 1010 DF, p-value: 0.06312
This is a statistically significant, but weak relationship with \(0.01 > p > 0.05\).
We ultimately want to build a model:
\[ y = b_0 + b_1x_1 + b_2x_2 + ... + b_px_p\] where \(y\) is a response variable of interest and \(x_i\)’s (where \(i=1, 2, ..., p\)) are covariates, ie. explanatory variables. We have already recognised that our explanatory variables are not independent, and we should expect them to have a joint effect on some part of \(y\). There will be a relationship between \(x_i\) and \(y\) that can’t be distinguished from the relationship between \(x_j\) and \(y\), as \(x_i\) and \(x_j\) are not independent. The relationship between \(x_i\) and \(y\) is no longer the full effect of \(x_i\) on \(y\). It’s actually the marginal, unique effect of \(x_i\) on \(y\), after controlling for the effect of \(x_j\).
While the model fit as a whole will include both the joint and the unique effects of all \(x_i\)’s on \(y\), the regression coefficient for individual \(x_i\) will only include its unique effect.
We will start building the model using the available explanatory variables, but before we accept this as the best fitted model, we need to seek answers to the following questions:
If the answer to the above question(s) is YES, next we need to assess:
fh_commision is the key variable of interest:\[fh\_commision = shipper\_closed\_price - carrier\_closed\_price\].
Hence, the two variables: shipper_closed_price and carrier_closed_price are directly related to the response variable:
fh_commision.As such, we will not consider them as covariates in the model.
fh_commision has a few extreme obsevations, one of which (observation no. \(203\)) has a very low negative value \(-1,750.00\). Should we expect this to happen?
which calculate varying time differences in hours.
Is there any other time difference that should be observed and does it make sense for them to be used in the analysis?
load_description is a very messy variable!!! 😣This variable needs some considerable work spent on it to make it suitable for modelling. This could be done using reguar expressions to ‘standardise’ the given categories.
fh_commision are found from the following variables:shipper_ask_price: weak, but nonetheless statistically significant with \(R^2 =1.36\%\);distance: statistically valid relationship with \(R^2=11.91\%\);duration: statistically valid relationship with \(R^2=10.32\%\);diffhourSP: statistically valid relationship with \(R^2=7.16\%\);diffhourMP: statistically significant, but weak relationship with \(R^2=0.34\%\).scheduled_at: month: with a clear dip during the summer months;delivery_scheduled_at: month: with a clear dip during the summer months;pickup_scheduled_at: month:with a clear dip during the summer months;load_equipment: \(p < 0.001\);destination_state: \(p-value < 2.2e-16\);destination_: postal_code; city; country: all have \(p < 2.2e-16\);origin_state: \(p-value < 2.2e-16\);origin_: postal_code; city; country: postal_code is very unbalanced and we could not perform a statistical test; the other two have \(p-value < 2.2e-16\);shipper_name: \(p-value < 2.2e-16\);is_hazmat has only a few observations in one of its two categories making it very unbalanced for statistical modelling;carrier_name has over \(300\) levels, ie. categories;status: the pricing would have been done before the status could be obtained?is_multistop: same information as is_cross_borderis_cross_border: should it be linked to destination_country?is_completed: after event to the ‘pricing’?!;is_dropped: ‘after event’?!max_bid: \(93.77\%\)min_bid: \(93.77\%\)no_bids_refused: \(93.77\%\)unloaded_at: \(42.59 \%\)carrier_winning_bid: \(98.42\%\)We would need to think carefully about the best way of approaching them.
primary_tracking_source: has only one possible outcomeid: same as system index numbershipper_id: same as shipper_nameshipment_no: same as system index numberdestination and origin _postal_code provide the same info as destination and origin state.destination_country and is_cross_border are not showing the same information?##
## Mexico United States
## 50 962
##
## FALSE TRUE
## 988 24
distance calculated?Calculating shipping distances using google api: https://developers.google.com/maps/documentation/distance-matrix/get-api-key
# google api for calculating google maps distances
library(gmapsdistance)
# test <- gmapsdistance(origin = from,
# destination = to,
# combinations = "pairwise",
# key = "YOURAPIKEYHERE",
# mode = "walking")
dp <- paste0('"', destination_point_lat, '+', destination_point_long, '"')
op <- paste0('"', origin_point_lat, '+', origin_point_long, '"')
results = gmapsdistance(origin = op,
destination = dp,
mode = "driving")
# convert results$Distance from m into miles
udunits2::ud.convert(results$Distance, "m", "mi")
#
# ------------ or with:
library(googleway)
#
# test <- google_distance(origins = from,
# destinations = to,
# mode = "walking",
# key="YOURAPIKEYHERE")
# -----------------
#
# Interesting to see (for example the 1st observation)
results = gmapsdistance(origin = "27.617417+-99.523012",
destination = "38.8645105+-76.7279378",
mode = "driving")
results$Distance <- udunits2::ud.convert(results$Distance, "m", "mi")
distance[1]
[1] 1780.323
duration[1]
[1] 97020
results
$Time
[1] 93651
# There is some small discrapency
$Distance
[1] 1774.97
$Status
[1] "OK"
Who does those calculations (distance and duration); when is this information collected?
the response variable:
the explanatory variables:
Thus, we have:
\[ y = b_0 +b_1x_1 + b_2x_2 + ... + b_{18}x_{18} \]
The majority of the variables are attribute type and, as we have seen earlier in the report when observing them individually and in relation to fh_commision many of them are correlated, providing the same type of information.
Another point that we need to consider is that many variables are high-cardinality categorical attribute variables, which will need to be encoded when used in the model fitting procedure.
We will fit a model for illustrative purposes, as we are not sure if the variable fh_commision is the key variable of interest, and some of the variables we wish to consider for our model are messy and in correlation with each other. Hence, we will simplify the set of explanatory variables we wish to use.
This is the data we will use for our model:
## Observations: 1,012
## Variables: 19
## $ fh_commission <dbl> 875.00, 625.00, 1075.00, 600.00...
## $ shipper_ask_price <dbl> 4375, 3125, 5375, 4000, 2625, 1...
## $ distance <dbl> 1780.3230, 1517.9297, 2177.3311...
## $ duration <int> 97020, 95100, 136440, 102360, 8...
## $ delivery_scheduled_at_month <fct> 6, 7, 7, 7, 7, 8, 8, 8, 8, 10, ...
## $ load_description <fct> CONSUMERGOODS, PIPES, BUILDINGM...
## $ load_equipment <fct> Dry Van, Dry Van, Flatbed, Flat...
## $ destination_state <fct> Maryland, Ohio, Washington, Ohi...
## $ destination_city <fct> Upper Marlboro, Findlay, Spokan...
## $ destination_country <fct> United States, United States, U...
## $ origin_state <fct> Texas, Texas, Texas, Texas, Ill...
## $ origin_city <fct> Laredo, Laredo, Laredo, Laredo,...
## $ origin_country <fct> United States, United States, U...
## $ shipper_name <fct> Ventus Freight LLC, Ventus Frei...
## $ diffhourSP <dbl> 66.0, 66.0, 92.0, 116.0, 90.0, ...
## $ diffhourMP <dbl> 0.100555556, 0.047777778, 0.035...
## $ cross_border <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ delivery_scheduled_until_month <fct> 6, 7, 7, 7, 7, 8, 8, 8, 8, 10, ...
## $ pickup_scheduled_at_month <fct> 6, 7, 7, 7, 6, 8, 8, 8, 8, 10, ...
Remember, there is a number of “messy” variables, wchich we will ignore for the purpose of just showing the modelling procedure and to illustrate the need for tidying them up.
## Start: AIC=2578.69
## fh_commission ~ shipper_ask_price + distance + duration + delivery_scheduled_at_month +
## load_description + load_equipment + destination_state + destination_city +
## destination_country + origin_state + origin_city + origin_country +
## shipper_name + diffhourSP + diffhourMP + cross_border + delivery_scheduled_until_month +
## pickup_scheduled_at_month
##
##
## Step: AIC=2578.69
## fh_commission ~ shipper_ask_price + distance + duration + delivery_scheduled_at_month +
## load_description + load_equipment + destination_state + destination_city +
## destination_country + origin_state + origin_city + origin_country +
## shipper_name + diffhourSP + diffhourMP + cross_border + pickup_scheduled_at_month
##
##
## Step: AIC=2578.69
## fh_commission ~ shipper_ask_price + distance + duration + delivery_scheduled_at_month +
## load_description + load_equipment + destination_state + destination_city +
## destination_country + origin_state + origin_city + origin_country +
## shipper_name + diffhourSP + diffhourMP + pickup_scheduled_at_month
##
##
## Step: AIC=2578.69
## fh_commission ~ shipper_ask_price + distance + duration + delivery_scheduled_at_month +
## load_description + load_equipment + destination_state + destination_city +
## destination_country + origin_state + origin_city + shipper_name +
## diffhourSP + diffhourMP + pickup_scheduled_at_month
##
##
## Step: AIC=2578.69
## fh_commission ~ shipper_ask_price + distance + duration + delivery_scheduled_at_month +
## load_description + load_equipment + destination_state + destination_city +
## destination_country + origin_state + shipper_name + diffhourSP +
## diffhourMP + pickup_scheduled_at_month
##
##
## Step: AIC=2578.69
## fh_commission ~ shipper_ask_price + distance + duration + delivery_scheduled_at_month +
## load_description + load_equipment + destination_state + destination_city +
## destination_country + shipper_name + diffhourSP + diffhourMP +
## pickup_scheduled_at_month
##
##
## Step: AIC=2578.69
## fh_commission ~ shipper_ask_price + distance + duration + delivery_scheduled_at_month +
## load_description + load_equipment + destination_state + destination_city +
## shipper_name + diffhourSP + diffhourMP + pickup_scheduled_at_month
##
##
## Step: AIC=2578.69
## fh_commission ~ shipper_ask_price + distance + duration + delivery_scheduled_at_month +
## load_description + load_equipment + destination_city + shipper_name +
## diffhourSP + diffhourMP + pickup_scheduled_at_month
##
##
## Step: AIC=2578.69
## fh_commission ~ shipper_ask_price + distance + duration + delivery_scheduled_at_month +
## load_description + destination_city + shipper_name + diffhourSP +
## diffhourMP + pickup_scheduled_at_month
##
## Df Deviance AIC
## - delivery_scheduled_at_month 5 1077521 2571.2
## - pickup_scheduled_at_month 4 1076738 2573.1
## - diffhourSP 1 1064695 2576.8
## - shipper_name 2 1080064 2577.7
## - diffhourMP 1 1071350 2578.1
## <none> 1064200 2578.7
## - distance 1 1076778 2579.1
## - shipper_ask_price 1 1085898 2580.8
## - duration 1 1091566 2581.8
## - load_description 13 1308824 2594.7
## - destination_city 46 2881812 2688.9
##
## Step: AIC=2571.21
## fh_commission ~ shipper_ask_price + distance + duration + load_description +
## destination_city + shipper_name + diffhourSP + diffhourMP +
## pickup_scheduled_at_month
##
## Df Deviance AIC
## - diffhourSP 1 1077521 2569.2
## - shipper_name 2 1091433 2569.8
## - diffhourMP 1 1083235 2570.3
## - pickup_scheduled_at_month 9 1174739 2570.8
## - distance 1 1087063 2571.0
## <none> 1077521 2571.2
## - shipper_ask_price 1 1099394 2573.3
## - duration 1 1108184 2574.9
## + delivery_scheduled_at_month 5 1064200 2578.7
## + delivery_scheduled_until_month 5 1064200 2578.7
## - load_description 13 1331715 2588.2
## - destination_city 46 3028979 2689.0
##
## Step: AIC=2569.21
## fh_commission ~ shipper_ask_price + distance + duration + load_description +
## destination_city + shipper_name + diffhourMP + pickup_scheduled_at_month
##
## Df Deviance AIC
## - shipper_name 2 1091628 2567.8
## - diffhourMP 1 1083347 2568.3
## - distance 1 1087073 2569.0
## - pickup_scheduled_at_month 9 1176950 2569.1
## <none> 1077521 2569.2
## + diffhourSP 1 1077521 2571.2
## - shipper_ask_price 1 1099399 2571.3
## - duration 1 1108239 2572.9
## + delivery_scheduled_at_month 5 1064695 2576.8
## + delivery_scheduled_until_month 5 1064695 2576.8
## - load_description 13 1334018 2586.6
## - destination_city 46 3074487 2690.1
##
## Step: AIC=2567.85
## fh_commission ~ shipper_ask_price + distance + duration + load_description +
## destination_city + diffhourMP + pickup_scheduled_at_month
##
## Df Deviance AIC
## - diffhourMP 1 1096958 2566.8
## - distance 1 1098254 2567.1
## <none> 1091628 2567.8
## + destination_state 1 1083803 2568.4
## - pickup_scheduled_at_month 9 1199281 2568.9
## + shipper_name 2 1077521 2569.2
## - shipper_ask_price 1 1112865 2569.8
## + diffhourSP 1 1091433 2569.8
## - duration 1 1124328 2571.8
## + delivery_scheduled_at_month 5 1080068 2575.7
## + delivery_scheduled_until_month 5 1080068 2575.7
## - load_description 30 2247425 2654.4
## - destination_city 52 3216467 2683.2
##
## Step: AIC=2566.84
## fh_commission ~ shipper_ask_price + distance + duration + load_description +
## destination_city + pickup_scheduled_at_month
##
## Df Deviance AIC
## - distance 1 1103429 2566.0
## <none> 1096958 2566.8
## + destination_state 1 1089297 2567.4
## + diffhourMP 1 1091628 2567.8
## + shipper_name 2 1083347 2568.3
## + diffhourSP 1 1096379 2568.7
## - shipper_ask_price 1 1118286 2568.8
## - pickup_scheduled_at_month 9 1217665 2570.0
## - duration 1 1129149 2570.7
## + delivery_scheduled_at_month 5 1086270 2574.9
## + delivery_scheduled_until_month 5 1086270 2574.9
## - load_description 30 2250577 2652.7
## - destination_city 52 3230649 2682.1
##
## Step: AIC=2566.04
## fh_commission ~ shipper_ask_price + duration + load_description +
## destination_city + pickup_scheduled_at_month
##
## Df Deviance AIC
## <none> 1103429 2566.0
## + distance 1 1096958 2566.8
## + destination_state 1 1097764 2567.0
## + diffhourMP 1 1098254 2567.1
## - shipper_ask_price 1 1123728 2567.7
## + diffhourSP 1 1103000 2568.0
## + shipper_name 2 1092657 2568.1
## - pickup_scheduled_at_month 9 1233113 2570.6
## - duration 1 1150892 2572.6
## + delivery_scheduled_at_month 5 1095296 2574.5
## + delivery_scheduled_until_month 5 1095296 2574.5
## - load_description 30 2255878 2651.2
## - destination_city 52 3235810 2680.4
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## fh_commission ~ shipper_ask_price + distance + duration + delivery_scheduled_at_month +
## load_description + load_equipment + destination_state + destination_city +
## destination_country + origin_state + origin_city + origin_country +
## shipper_name + diffhourSP + diffhourMP + cross_border + delivery_scheduled_until_month +
## pickup_scheduled_at_month
##
## Final Model:
## fh_commission ~ shipper_ask_price + duration + load_description +
## destination_city + pickup_scheduled_at_month
##
##
## Step Df Deviance Resid. Df Resid. Dev
## 1 72 1064200
## 2 - delivery_scheduled_until_month 0 0.000000e+00 72 1064200
## 3 - cross_border 0 0.000000e+00 72 1064200
## 4 - origin_country 0 0.000000e+00 72 1064200
## 5 - origin_city 0 0.000000e+00 72 1064200
## 6 - origin_state 0 0.000000e+00 72 1064200
## 7 - destination_country 0 0.000000e+00 72 1064200
## 8 - destination_state 0 0.000000e+00 72 1064200
## 9 - load_equipment 0 2.328306e-10 72 1064200
## 10 - delivery_scheduled_at_month 5 1.332102e+04 77 1077521
## 11 - diffhourSP 1 3.237737e-01 78 1077521
## 12 - shipper_name 2 1.410728e+04 80 1091628
## 13 - diffhourMP 1 5.329806e+03 81 1096958
## 14 - distance 1 6.470938e+03 82 1103429
## AIC
## 1 2578.688
## 2 2578.688
## 3 2578.688
## 4 2578.688
## 5 2578.688
## 6 2578.688
## 7 2578.688
## 8 2578.688
## 9 2578.688
## 10 2571.213
## 11 2569.213
## 12 2567.854
## 13 2566.843
## 14 2566.037
##
## Call:
## glm(formula = fh_commission ~ shipper_ask_price + duration +
## load_description + destination_city + pickup_scheduled_at_month,
## data = md[-train, ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -212.76 -10.85 0.00 15.27 463.71
##
## Coefficients: (26 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.463e+02 4.386e+02 -1.473 0.144474
## shipper_ask_price 5.922e-03 4.821e-03 1.228 0.222882
## duration 1.839e-02 9.791e-03 1.878 0.063926 .
## load_description1 -2.152e+02 2.055e+02 -1.047 0.298115
## load_description2 -4.293e+01 2.074e+02 -0.207 0.836542
## load_description3 -7.479e+02 3.364e+02 -2.223 0.028939 *
## load_description4 -7.730e+02 4.760e+02 -1.624 0.108266
## load_description5 3.259e+02 3.239e+02 1.006 0.317287
## load_description6 -7.126e+02 4.253e+02 -1.675 0.097646 .
## load_description7 -8.718e+02 4.629e+02 -1.883 0.063190 .
## load_description8 -6.280e+02 2.769e+02 -2.268 0.025955 *
## load_description9 4.473e+02 3.127e+02 1.430 0.156464
## load_description10 2.753e+02 4.408e+02 0.625 0.533966
## load_description11 -8.336e+02 4.501e+02 -1.852 0.067600 .
## load_description12 -1.442e+03 1.060e+03 -1.360 0.177428
## load_description13 -1.370e+03 4.564e+02 -3.002 0.003552 **
## load_description14 4.840e+01 5.949e+02 0.081 0.935361
## load_description15 -9.477e+02 4.297e+02 -2.205 0.030232 *
## load_description16 -9.477e+02 4.297e+02 -2.205 0.030232 *
## load_description17 2.213e+02 5.455e+02 0.406 0.685991
## load_description18 -2.099e+02 3.507e+02 -0.598 0.551158
## load_description19 4.854e+02 2.712e+02 1.790 0.077170 .
## load_description20 1.905e+02 1.633e+02 1.167 0.246729
## load_description21 -1.174e+03 5.519e+02 -2.128 0.036378 *
## load_description22 -7.336e+02 3.802e+02 -1.930 0.057115 .
## load_description23 3.898e+02 2.352e+02 1.658 0.101197
## load_description24 7.249e+02 5.569e+02 1.302 0.196690
## load_description25 -6.414e+02 2.643e+02 -2.427 0.017434 *
## load_description26 3.218e+02 2.493e+02 1.291 0.200435
## load_description27 -9.828e+02 4.665e+02 -2.107 0.038172 *
## load_description28 -3.755e+02 5.062e+02 -0.742 0.460266
## load_description29 -1.691e+01 1.353e+02 -0.125 0.900834
## load_description30 8.835e+02 3.685e+02 2.397 0.018785 *
## load_description31 -4.435e+02 6.799e+02 -0.652 0.516019
## load_description32 3.003e+02 4.228e+02 0.710 0.479459
## load_description33 1.444e+03 8.769e+02 1.647 0.103488
## load_description34 6.166e+01 5.800e+02 0.106 0.915590
## load_description35 -7.019e+02 4.987e+02 -1.408 0.163044
## load_description36 -2.105e+02 3.015e+02 -0.698 0.487074
## load_description37 -1.036e+03 3.987e+02 -2.599 0.011081 *
## load_description38 3.641e+02 4.361e+02 0.835 0.406151
## load_description39 -5.161e+02 4.269e+02 -1.209 0.230157
## load_description40 -1.172e+03 6.500e+02 -1.804 0.074936 .
## load_description41 -9.583e+02 6.770e+02 -1.415 0.160712
## load_description42 -1.041e+03 5.471e+02 -1.903 0.060512 .
## load_description43 -1.945e+03 1.212e+03 -1.604 0.112535
## load_description44 -8.739e+02 5.495e+02 -1.590 0.115616
## load_description45 -1.232e+03 5.541e+02 -2.223 0.028939 *
## load_description46 -3.171e+02 1.885e+02 -1.682 0.096343 .
## load_description47 4.324e+02 3.481e+02 1.242 0.217629
## load_description48 4.184e+02 3.395e+02 1.232 0.221384
## load_description49 -7.212e+02 2.110e+02 -3.418 0.000986 ***
## load_description50 1.896e+04 1.115e+04 1.700 0.092963 .
## load_description51 3.686e+02 4.197e+02 0.878 0.382410
## load_description52 1.440e+03 8.692e+02 1.656 0.101496
## load_description53 1.474e+02 3.252e+02 0.453 0.651468
## load_description54 2.280e+02 3.275e+02 0.696 0.488307
## load_description55 -1.311e+03 6.567e+02 -1.996 0.049223 *
## load_description56 -1.583e+03 6.579e+02 -2.406 0.018354 *
## destination_city1 7.716e+02 4.276e+02 1.804 0.074830 .
## destination_city2 4.613e+02 3.054e+02 1.511 0.134728
## destination_city3 1.274e+03 6.749e+02 1.888 0.062629 .
## destination_city4 6.039e+02 3.477e+02 1.737 0.086146 .
## destination_city5 1.752e+03 1.322e+03 1.325 0.188860
## destination_city6 -1.914e+04 1.108e+04 -1.727 0.087870 .
## destination_city7 5.035e+02 2.514e+02 2.003 0.048495 *
## destination_city8 NA NA NA NA
## destination_city9 -1.989e+02 2.435e+02 -0.817 0.416469
## destination_city10 NA NA NA NA
## destination_city11 1.261e+03 4.617e+02 2.732 0.007711 **
## destination_city12 -5.866e+02 4.392e+02 -1.335 0.185434
## destination_city13 -1.437e+02 2.306e+02 -0.623 0.535056
## destination_city14 NA NA NA NA
## destination_city15 NA NA NA NA
## destination_city16 -5.440e+02 3.563e+02 -1.527 0.130699
## destination_city17 4.266e+02 2.312e+02 1.845 0.068634 .
## destination_city18 -1.101e+02 1.594e+02 -0.691 0.491603
## destination_city19 -7.488e+02 6.461e+02 -1.159 0.249785
## destination_city20 2.074e+02 2.316e+02 0.895 0.373223
## destination_city21 6.473e+02 3.726e+02 1.737 0.086063 .
## destination_city22 6.525e+02 3.682e+02 1.772 0.080101 .
## destination_city23 1.658e+02 1.604e+02 1.034 0.304205
## destination_city24 NA NA NA NA
## destination_city25 NA NA NA NA
## destination_city26 -1.413e+03 8.151e+02 -1.734 0.086744 .
## destination_city27 5.492e+02 2.998e+02 1.832 0.070616 .
## destination_city28 NA NA NA NA
## destination_city29 1.883e+02 2.511e+02 0.750 0.455403
## destination_city30 NA NA NA NA
## destination_city31 1.002e+03 5.719e+02 1.752 0.083514 .
## destination_city32 2.544e+02 1.656e+02 1.537 0.128254
## destination_city33 1.426e+03 8.453e+02 1.687 0.095390 .
## destination_city34 1.631e+03 1.264e+03 1.290 0.200578
## destination_city35 3.914e+02 1.965e+02 1.992 0.049711 *
## destination_city36 3.948e+02 7.518e+02 0.525 0.600921
## destination_city37 1.484e+02 2.389e+02 0.621 0.536132
## destination_city38 9.378e+02 4.988e+02 1.880 0.063665 .
## destination_city39 NA NA NA NA
## destination_city40 3.984e+02 2.771e+02 1.438 0.154373
## destination_city41 1.364e+02 1.791e+02 0.761 0.448609
## destination_city42 -2.786e+01 1.912e+02 -0.146 0.884491
## destination_city43 NA NA NA NA
## destination_city44 NA NA NA NA
## destination_city45 NA NA NA NA
## destination_city46 6.856e+02 3.919e+02 1.749 0.083998 .
## destination_city47 1.111e+02 1.505e+02 0.738 0.462561
## destination_city48 9.588e+02 5.605e+02 1.710 0.090961 .
## destination_city49 1.497e+02 1.887e+02 0.793 0.430007
## destination_city50 2.349e+02 1.414e+02 1.662 0.100344
## destination_city51 NA NA NA NA
## destination_city52 NA NA NA NA
## destination_city53 3.098e+02 1.919e+02 1.614 0.110356
## destination_city54 8.079e+02 3.962e+02 2.039 0.044667 *
## destination_city55 NA NA NA NA
## destination_city56 1.583e+03 8.606e+02 1.840 0.069425 .
## destination_city57 9.667e+02 4.326e+02 2.235 0.028146 *
## destination_city58 -2.899e+02 5.729e+02 -0.506 0.614243
## destination_city59 NA NA NA NA
## destination_city60 2.001e+02 2.442e+02 0.819 0.415078
## destination_city61 NA NA NA NA
## destination_city62 3.298e+02 2.182e+02 1.512 0.134450
## destination_city63 5.195e+01 6.292e+02 0.083 0.934397
## destination_city64 -3.925e+02 3.099e+02 -1.267 0.208897
## destination_city65 -9.606e+02 4.848e+02 -1.982 0.050876 .
## destination_city66 NA NA NA NA
## destination_city67 -1.592e+02 1.978e+02 -0.805 0.423311
## destination_city68 -6.315e+01 1.696e+02 -0.372 0.710607
## destination_city69 NA NA NA NA
## destination_city70 1.247e+03 6.424e+02 1.941 0.055643 .
## destination_city71 1.604e+02 1.621e+02 0.989 0.325496
## destination_city72 NA NA NA NA
## destination_city73 5.890e+02 3.334e+02 1.767 0.081006 .
## destination_city74 NA NA NA NA
## destination_city75 NA NA NA NA
## destination_city76 NA NA NA NA
## destination_city77 NA NA NA NA
## pickup_scheduled_at_month1 1.204e+02 5.513e+01 2.183 0.031863 *
## pickup_scheduled_at_month2 4.330e+01 6.260e+01 0.692 0.491046
## pickup_scheduled_at_month3 4.358e+01 7.160e+01 0.609 0.544413
## pickup_scheduled_at_month4 7.580e+01 7.356e+01 1.031 0.305775
## pickup_scheduled_at_month5 1.611e+01 8.058e+01 0.200 0.842031
## pickup_scheduled_at_month6 -3.642e+02 3.864e+02 -0.943 0.348603
## pickup_scheduled_at_month7 NA NA NA NA
## pickup_scheduled_at_month8 2.731e+01 7.715e+01 0.354 0.724311
## pickup_scheduled_at_month9 1.165e+01 5.842e+01 0.199 0.842413
## pickup_scheduled_at_month10 -2.592e+01 5.088e+01 -0.509 0.611772
## pickup_scheduled_at_month11 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 13456.45)
##
## Null deviance: 9110366 on 202 degrees of freedom
## Residual deviance: 1103429 on 82 degrees of freedom
## AIC: 2566
##
## Number of Fisher Scoring iterations: 2
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## round.pred..digits...2. round.md.fh_commission.test...digits...2.
## 11 100.00 100.00
## 16 544.58 400.00
## 20 536.29 1000.00
## 28 186.06 150.00
## 29 132.03 300.00
## 35 277.21 300.00
## 47 700.00 700.00
## 59 100.00 100.00
## 70 223.97 200.00
## 74 257.35 250.00
## 76 50.00 50.00
## 77 567.19 600.00
## 80 269.48 250.00
## 85 195.34 200.00
## 90 157.96 200.00
## 92 393.30 600.00
## 97 114.36 0.00
## 99 334.15 200.00
## 107 200.00 200.00
## 109 620.94 700.01
## 113 253.66 150.00
## 114 41.42 28.00
## 129 233.16 388.50
## 170 200.00 200.00
## 183 -300.00 -300.00
## 199 180.99 200.00
## 206 250.00 250.00
## 210 61.11 50.00
## 217 1030.00 1030.00
## 229 250.30 300.00
## 238 556.16 600.00
## 252 144.20 100.00
## 263 249.70 200.00
## 266 450.00 500.00
## 268 528.85 700.00
## 278 245.77 50.00
## 280 544.58 425.00
## 282 254.59 250.00
## 286 200.00 200.00
## 287 150.00 150.00
## 289 -143.00 -143.00
## 296 185.96 200.00
## 300 250.00 250.00
## 301 253.66 500.00
## 308 273.00 273.00
## 314 212.50 212.50
## 316 550.00 550.00
## 318 186.00 186.00
## 320 106.50 106.50
## 323 586.56 600.00
## 330 281.13 300.00
## 344 50.00 50.00
## 346 200.00 200.00
## 360 395.00 395.00
## 361 72.02 100.00
## 363 412.76 200.00
## 378 25.00 25.00
## 383 309.06 309.06
## 385 368.54 161.84
## 388 553.40 400.01
## 395 133.64 50.00
## 396 250.00 250.00
## 401 300.00 300.00
## 405 240.00 240.00
## 409 183.50 200.00
## 419 542.37 600.00
## 442 210.01 0.00
## 455 -100.00 -100.00
## 456 450.00 400.00
## 457 50.00 50.00
## 459 100.00 100.00
## 460 379.19 400.00
## 461 379.19 500.00
## 462 300.01 200.00
## 465 153.56 150.00
## 470 42.80 100.00
## 471 -100.00 -100.00
## 475 550.00 550.00
## 480 609.00 609.00
## 491 207.48 200.00
## 495 390.17 250.00
## 497 254.60 300.00
## 498 420.47 350.00
## 505 528.85 700.00
## 513 112.04 100.00
## 520 198.70 100.00
## 523 565.00 565.00
## 525 37.58 100.00
## 527 500.00 500.00
## 529 205.00 150.00
## 530 343.55 343.55
## 540 190.49 300.00
## 541 0.00 0.00
## 543 95.00 95.00
## 545 95.00 95.00
## 550 386.20 400.00
## 558 679.85 650.00
## 562 700.00 700.00
## 564 262.59 300.00
## 565 157.96 200.00
## 572 157.96 0.00
## 575 300.00 300.00
## 580 233.05 200.00
## 584 253.66 100.00
## 585 300.00 300.00
## 590 253.91 200.00
## 601 300.00 300.00
## 605 233.06 250.00
## 606 142.06 170.00
## 617 0.00 0.00
## 623 233.06 350.00
## 626 69.88 71.40
## 628 189.63 189.63
## 630 484.50 484.50
## 631 103.16 71.40
## 638 173.92 150.00
## 639 544.58 500.00
## 641 233.16 200.00
## 642 544.58 425.00
## 646 500.00 500.00
## 649 102.66 212.25
## 657 200.00 200.00
## 658 300.01 250.00
## 660 30.00 30.00
## 676 200.00 200.00
## 680 231.60 300.00
## 681 200.00 200.00
## 685 200.83 50.00
## 692 0.00 0.00
## 698 0.00 0.00
## 701 39.78 100.00
## 705 287.50 287.50
## 706 132.13 100.00
## 713 178.06 150.00
## 715 92.00 92.00
## 727 60.68 28.00
## 728 200.00 200.00
## 736 157.08 157.08
## 739 222.79 200.00
## 744 230.60 250.00
## 749 439.36 450.00
## 751 201.30 300.00
## 755 640.11 600.00
## 756 500.00 500.00
## 758 148.90 100.00
## 764 322.72 400.00
## 771 659.89 700.00
## 774 355.80 400.00
## 775 310.59 300.00
## 777 265.85 400.00
## 778 276.03 300.00
## 784 -130.00 -130.00
## 785 100.00 100.00
## 787 114.36 100.00
## 790 50.00 50.00
## 791 150.00 150.00
## 793 -150.00 -150.00
## 800 150.00 150.00
## 803 39.78 100.00
## 804 200.00 200.00
## 806 233.06 250.00
## 811 100.00 100.00
## 813 142.06 150.00
## 830 185.58 104.72
## 831 69.88 71.40
## 835 600.00 600.00
## 838 816.30 816.30
## 839 816.30 816.30
## 841 423.60 423.60
## 842 300.00 300.00
## 845 222.19 100.00
## 848 200.00 200.00
## 852 171.62 300.00
## 863 54.82 150.00
## 865 229.53 300.00
## 869 114.36 0.00
## 873 0.00 0.00
## 875 50.00 50.00
## 877 0.00 0.00
## 902 205.00 260.00
## 904 142.57 150.00
## 905 39.78 100.00
## 909 0.00 0.00
## 914 142.57 100.00
## 920 210.00 210.00
## 924 620.15 650.00
## 925 118.37 150.00
## 926 60.00 60.00
## 929 500.00 500.00
## 930 75.00 75.00
## 937 390.92 400.00
## 938 300.00 300.00
## 949 151.10 200.00
## 964 100.00 100.00
## 965 150.00 150.00
## 975 164.86 200.00
## 977 409.08 400.00
## 978 247.22 200.00
## 981 131.72 100.00
## 983 80.89 71.40
## 986 40.32 52.36
## 992 3.04 3.04
## 998 608.80 500.00